-
Notifications
You must be signed in to change notification settings - Fork 75
C++ implementation of crop transform #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| match="out of range", | ||
| ): | ||
| add_video_stream(decoder, transform_specs="resize, 100, 1000000000000") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests start below this line.
| FrameDims::FrameDims(int height, int width) : height(height), width(width) { | ||
| TORCH_CHECK(height > 0, "FrameDims.height must be > 0, got: ", height); | ||
| TORCH_CHECK(width > 0, "FrameDims.width must be > 0, got: ", width); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by: when passing in a height and width, we should only be able to instantiate a FrameDims object with positive values. If we want a FrameDims object that has 0 for both values, that's just the default constructor. We should never have a FrameDims object with negative values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @scotts ! I left a few comments below, this looks good.
On this design discussion point:
- We apply the crop in RGB space. Note that in the references I generate from the ffmpeg CLI, I explicitly add format=rgb24 before the filters. If we don't explicitly do that format conversion, the frames we get from the ffmpeg CLI will have been cropped in YUV space and we won't get tensor equality with TorchCodec or TorchVision's transforms. I think it would actually be faster to crop first, as then the colorspace conversion needs to do less work. But I think our ideal correctness comparison should be the result we get from passing a unchanged decoded frame to a TorchVision transform.
I don't remember having an explicit discussion about our inclusion criteria, so let me share my personal thoughts on this. When implementing a new TC transform, I think we should consider these two questions:
- A. Perf: Is the TC transform faster than the TV transform? We want the answer to be yes. If the answer is no then it's not worth implementing in TC. Users can just use TV.
- B. Correctness: Can a model tell the difference between a TV transform and its TC counterpart? We want the answer to be no. We know for example that for
resize, we'll have to be careful about this. The easiest way to ensure that is to check that the TC output is exactly equal to the TV output - that's what this PR is doing, and this guarantees correctness with 100% certainty. In general though:- we won't always be able to check strict equality with TV.
- strict equality with TV is a sufficient, but not always necessary condition to validate correctness. For example, the PIL and Tensor backend of TV's
resizeslightly differ in terms of output (usually within 1-pixel difference for bilinear mode, and a bit more for bicubic mode), but they're still un-distinguishable for models. I suspect this will also be the case between TV's resize and TC's resize. And maybe, it's OK forcropto have slightly different output resuts, as long as models can't tell the difference.
Back to our discussion point now: should we apply crop in YUV or in RGB?
I could be wrong but I think applying the crop in RGB will have the exact same perf as TC's crop. And if that's the case then we're not meeting criteria A. I think it's worth benchmarking this a bit, and I can help with that if needed.
We know that applying crop in RGB meets criteria B, so if we choose to apply it in YUV we'll have to check that it's still correct. We won't have strict equality with TV, but as I mentioned, this won't always be possible (or necessary!) anyway. Maybe we can have psnr checks with very strict tolerance. And we should try to validate filtergraph's algorithms (e.g. see if there's any kind of fancy filtering going on, which might be a problem) - that's also something I can help with.
src/torchcodec/_core/custom_ops.cpp
Outdated
| // Where "crop" is the string literal and <width>, <height>, <x> and <y> are | ||
| // positive integers. Note that that in this spec, we are following the | ||
| // filtergraph convention of (width, height). This makes it easier to compare it | ||
| // against actual filtergraph strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC x and y are essentially the top-left coordinates of the crop, let's make that explicit (it could also be e.g. the center coordinates of the crop).
Separately, I was hoping we could use H and W everywhere for consistency. Can you share the comparison against filtergraph strings that are made easier by using the W and H order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, I'm fine with (h, w). It just means that in the tests and in the resource generation, we'll have strings that look like:
crop_spec = "crop, 200, 300, 50, 35" # note: h=200, w=300
crop_filtergraph = "crop=300:200:50:35:exact=1". # note: w=300, h=200Comments can cover it. We'll have to deal with this awkwardness somewhere. I don't have a strong preference where, and I think it's probably good to keep all of TorchCodec in (h, w).
| TORCH_CHECK(x_ <= streamMetadata.width, "Crop x position out of bounds"); | ||
| TORCH_CHECK(y_ <= streamMetadata.height, "Crop y position out of bounds"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should/can we account for H and W of the crop as well? E.g.
TORCH_CHECK(x_ + W < streamMetadata.width, "Crop x position out of bounds");
test/generate_reference_resources.py
Outdated
| "1", | ||
| output_path, | ||
| ] | ||
| print("===" + " ".join([str(x) for x in cmd])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this debug leftover? Just checking, maybe it's intended and OK since this is just the generation util.
|
|
||
| from .utils import assert_frames_equal, NASA_VIDEO, needs_cuda | ||
|
|
||
| torch._dynamo.config.capture_dynamic_output_shape_ops = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we don't need this line in this file for now, nor the os.environ["TORCH_LOGS"] = "output_code" above, since they relate to torch.compile tests.
|
@NicolasHug, thanks for the discussion!
I actually think criteria A is too strict, as it's strictly phrased about CPU performance and only considers transforms in isolation. I think we need to also consider:
Regarding cropping in RGB versus YUV space, this is actually a general question that will apply to all transforms. What I think I'd like to go with now is: we default to doing the transform in RGB space for now, as that means we are more likely to be able to have bit-for-bit equality with TorchVision transforms. Whether our users realize it or not, I also think that's what they'll expect. If we get requests for applying the transforms in YUV space, we could provide an API to request it. (Either a decoder level option, or some special transform which would allow users to have some of their transforms in YUV, some in RGB.) |
| import torch | ||
| from PIL import Image | ||
|
|
||
| from .utils import sanitize_filtergraph_expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that pulling in code from utils.py means we have to run this as a module, and since utils.py imports general TorchCodec stuff, we do a full build and install for the reference_resources.yaml workflow. I think that's a fair trade-off to ensure that the unit tests and the resource generation are using the same logic for writing and loading files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to pull in more things from utils.py? If it's just for sanitize_filtergraph_expression, I would have a slight preference for just duplicating it in both files, because the python -m stuff isn't immediately obvious. But that's fine too.
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scotts, LGTM
Agreed on both points above that we also care about memory usage, and about enabling transforms that may come later in the pipeline.
I suspect doing the crop in YUV (or even potentially within the decoder itself like for jpeg) will lead to lower memory usage and to better perf overall. But, if as discussed offline, we reserve the right to change the underlying implementation of the transforms as long as we're confident it won't break models, then having the first implementation in RGB sounds great!
| import torch | ||
| from PIL import Image | ||
|
|
||
| from .utils import sanitize_filtergraph_expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to pull in more things from utils.py? If it's just for sanitize_filtergraph_expression, I would have a slight preference for just duplicating it in both files, because the python -m stuff isn't immediately obvious. But that's fine too.
|
@NicolasHug, yes, I anticipate unifying |
Adds C++ implementation for the crop transform.
Design points worth discussing:
exact=1from the FFmpegcropfilter. Based on reading, this behavior is going to be consistent withtorchvision.transforms.v2.functional.crop. Because we're eventually going to accept the TorchVision transform as the public API for crop, users won't be able to set this, so I think we have to be conservative.ffmpegCLI, I explicitly addformat=rgb24before the filters. If we don't explicitly do that format conversion, the frames we get from theffmpegCLI will have been cropped in YUV space and we won't get tensor equality with TorchCodec or TorchVision's transforms. I think it would actually be faster to crop first, as then the colorspace conversion needs to do less work. But I think our ideal correctness comparison should be the result we get from passing a unchanged decoded frame to a TorchVision transform.Implementation details:
Transform::validate(const StreamMetadata&)because sometimes we need to validate transforms based on the stream itself. For crop, we need to validate that the (x, y) coordinates are within the original frame's dimensions. (We could let FFmpeg throw an error instead, but that's not a great experience for our users. That will only happen upon a decode, and the error they'll get is not going to be understandable by them.) I wavered on whatvalidate()should accept to do the validation, and I realized it's probably best for it to beStreamMetadata. If in the future what we need to validate against is not in the metadata, we should add it. I'd rather pass that around than the actualAVStream, and we might need to check against more than just the dimensions in the future.We can use a different name if we want, but it seemed simple to use the filter expression itself.It looks like Windows does not like these filenames, so we'll have to do something else.test_transform_ops.py, and moved all of the existing resize (scale) tests in there as well. My rule is that any test of the core API which setstransform_specsgoes in the new test file. Whiletest_ops.pyis getting kinda long, my main motivation is that will simplify internal testing. Internally, we want to limit the places that TorchCodec and TorchVision must be used at the same time due to FFmpeg version issues.